Given the high relevance of COVID-19 in 2020 and the impact it has on the vast majority of individuals around the globe, staying informed regarding the current status and spread of COVID around the world has become a daily task for many people. While the severity of the situation necessitates daily monitoring and continuous tracking of the virus, generating enormous amounts of data each day, this also makes it easy for the general population to be overwhelmed by the overabundance of COVID-related data made available from the scientific community. When dealing with such the issue of big datasets, data visualization can serve the critical function of making large data more accessible and comprehensible to the public and the decision makers at large. In our project, we will explore a few of these different graphical concepts that may facilitate a better understanding of the large amounts of information and see how such concepts can turn simple numbers into a relevant narrative.
In this project, we use datasets from two different sources.
The first dataset, containing racial and ethnic data on COVID-19 cases in the United States, comes from a collaboration between COVID Tracking Project and Boston University Center for Antiracist Research. This racial COVID data include the number of cases and deaths for different races and ethnicities that are reported across all 50 states. For this first dataset, marginal histogram is used to visualize whether COVID-19 affects different races disproportionately across the US.
The second dataset comes from Our World in Data, which gathers COVID-19 data from various government agencies across the globe. In addition to reporting more basic measures such as the number of cases, deaths, hospitalization, testing, etc., this dataset also includes social measures such as the government stringency index and the human development index. The government stringency index is calculated by considering various restrictions the government imposes on the citizens such as wearing a mask in public, workplace closure, international travel ban, etc. The human development index is calculated by considering various measures of achievement in key dimensions of human development in the country such as life expectancy and standard of living.
With this dataset, we’ll first demonstrate the usage of bubble plots. Bubble plot allows plotting of more than 2 variables with the weight or size of the bubble serving as the 3rd axis, and will be used to visualize the relationship between these social measures and the number of COVID-19 deaths in a country.
Using the same dataset, we’ll also demonstrate how to create an interactive plot. Interactive plot allows users to manipulate the ways in which the data can be viewed. Some of the ways in which one can interact with the plot may be as simple as being able to zoom in and out of different sections of the plot, include or exclude certain variables in the plot, or click on a particular bar or point to see the actual value. We will walk through the implementation of these mentioned interactive functions into our plots. This interactive plot will be a bar graph showing the number of COVID-19 cases, deaths, hospitalizations, and testing numbers in 5 different Eastern European countries.
In this tutorial, we are using R, STATA, MATLAB and Python to show how these graphical concepts may be implemented.
| Software | Relevant Libraries |
|---|---|
| R | tidyverse (manipulation of data), ggplot2 (plotting), cowplot (plotting), ggplotly (adding interactive functionality) |
| STATA | the built-in command scatter and twoway histogram used to create the three subplots in marginal plots, and user-written command grc1leg is used to combine these subplots into one figure with a command legend. Scatter is also used to create the bubble plot. In the extended example of multi-categories histograms, user-written commands tabcount (calculate frequency over categories), bihist and byhist (plotting histograms) are also used. |
| Python | numpy (manipulation of data) , pandas (manipulation of data), matplotlib (plotting), seaborn (plotting), plotly (adding interactive functionality) |
| MATLAB | Clickable Legend Toolbox (adding interactive functionality) and subaxis (allow margin adjustment on subplot) from MATLAB File Exchange, plotly-graphing-library-for-matlab (adding interactive functionality) from plotly (bubblechart requires ). ** NOTE: bubblechart function used in plotting bubblechart requires MATLAB_R2020b. |
A scatter plot is one of the most frequently used visualizations. While it has the benefits of retaining the exact data values and sample size, it can be difficult to see the patterns behind those massive number of data points, such as how distributions vary along the axes and across different categories. The combination of scatter plots and marginal plots (usually histograms or box plots) allows us to exploit the benefits of both plot types and observe the connection between individual data points and their aggregation.
In this example, we use a scatter plot to display the # of cases and deaths across the four racial groups, and introduce marginal histograms of both axes to analyze the distribution.
# setup
library(tidyverse)filename_racial = "./Data/Race Data Entry - CRDT.csv"
racial_data = read_delim(
filename_racial, delim = ",",
col_types = cols(
.default = col_integer(),
Date = col_character(),
State = col_character()
)
) %>%
pivot_longer(
cols = starts_with(c("Cases", "Deaths")),
names_pattern = "(Cases|Deaths)_(.*)",
names_to = c(".value", "Race")
) %>%
filter(!str_detect(Race, "Total|Ethnicity"))
date_picked = "20201101"
race_picked = c("White", "Black", "LatinX", "Asian")
plot1_data = racial_data %>%
filter(
Date == date_picked,
str_detect(Race, paste0(race_picked, collapse = "|"))
) %>%
mutate(
Race = factor(Race, levels = race_picked, ordered = TRUE)
)| Date | State | Race | Cases | Deaths |
|---|---|---|---|---|
| 20201101 | AK | White | 5063 | 31 |
| 20201101 | AK | Black | 574 | 3 |
| 20201101 | AK | LatinX | NA | NA |
| 20201101 | AK | Asian | 680 | 8 |
| 20201101 | AL | White | 59056 | 1555 |
| Date | State | Race | Cases | Deaths |
|---|---|---|---|---|
| 0/224 | 0/224 | 0/224 | 51/224 | 54/224 |
plot1_title = sprintf(
"%s (%s)",
"Total confirmed COVID-19 deaths vs. cases, U.S. States",
format(as.Date(date_picked, "%Y%m%d"), "%m/%d/%y")
)
palette_picked = "Set1" # pick a color palette
# use `RColorBrewer::display.brewer.all()` to check out available palettes
pic1_main = plot1_data %>%
ggplot(aes(x = Cases, y = Deaths, color = Race)) +
theme_bw() +
geom_point(size = 2, alpha = .7) +
scale_color_brewer(palette = palette_picked) +
ggtitle(plot1_title) +
theme(
plot.title = element_text(hjust = 0.5, size = 12),
axis.title = element_text(size = 9)
)
print(pic1_main)pic1 = pic1_main +
theme(legend.position = "left") # move legend to the left side
# because ggMarginal() adds margin plots on the right side
# marginal histogram
ggExtra::ggMarginal(
pic1, type="histogram",
groupColour = TRUE, groupFill = TRUE
)We can create marginal box plots by setting type = "boxplot". The parameter size specifies the relative size of the main plot to the marginal ones. A size of 5 means that the main plot is 5x wider and 5x taller than the marginal plots.
ggExtra::ggMarginal(
pic1, type="boxplot", size = 5,
groupColour = TRUE, groupFill = TRUE
)Although ggMarginal() provides a easy solution to marginal plots, it is not friendly to customization. For example, we cannot set the two marginal plots to be of different types or of different sizes. Neither can we specify the exact size for marginal plots in units such as inch.
Check out the cowplot section for a more flexible solution to creating marginal plots in R.
Using package cowplot to create marginal plots is less succinct, but more flexible. Unlike ggExtra::ggMarginal, cowplot allows:
pic1_xhist = cowplot::axis_canvas(pic1_main, axis = "x") +
geom_histogram(
data = plot1_data %>% filter(!is.na(Cases) & !is.na(Deaths)),
aes(x = Cases, fill = Race, color = Race),
bins = 30, alpha = .6
) +
scale_fill_brewer(palette = palette_picked) +
scale_color_brewer(palette = palette_picked)
pic1_yhist = cowplot::axis_canvas(pic1_main, axis = "y", coord_flip = TRUE) +
geom_histogram(
data = plot1_data %>% filter(!is.na(Cases) & !is.na(Deaths)),
aes(x = Deaths, fill = Race, color = Race),
bins = 30, alpha = .6
) +
scale_fill_brewer(palette = palette_picked) +
scale_color_brewer(palette = palette_picked) +
coord_flip()
pic1_xhist
pic1_yhist{pic1_main +
theme(legend.position = "left") # move legend to the left
# to make room for margin plots
} %>%
cowplot::insert_xaxis_grob(
pic1_xhist,
height = grid::unit(.8, "in"),
position = "top"
) %>%
cowplot::insert_yaxis_grob(
pic1_yhist,
width = grid::unit(.8, "in"),
position = "right"
) %>%
cowplot::ggdraw()We can move the marginal plots from the top/right side of the original figure to the bottom/left side.
To do that, we may want to flip the two previously created marginal histograms using scale_y_reverse(). To make room for the marginal plots, we can move the x-axis and y-axis to the top/right side of the original figure by setting the position parameter in scale_*_continuous().
{pic1_main +
# move the axis labels to the top/right to make room for margin plots
scale_x_continuous(position = "top") +
scale_y_continuous(position = "right")
} %>%
cowplot::insert_xaxis_grob(
pic1_xhist + scale_y_reverse(),
height = grid::unit(.8, "in"),
position = "bottom"
) %>%
cowplot::insert_yaxis_grob(
pic1_yhist + scale_y_reverse(),
width = grid::unit(.8, "in"),
position = "left"
) %>%
cowplot::ggdraw()Using cowplot, we can easily specify different plot types for the x,y-axis marginal plots. Below is the code example to replace the y-axis marginal histogram with a box plot.
pic1_ybox = cowplot::axis_canvas(pic1_main, axis = "y") +
geom_boxplot(
data = plot1_data %>% filter(!is.na(Cases) & !is.na(Deaths)),
aes(x = 0, y = Deaths, fill = Race, color = Race),
alpha = .6
) +
scale_fill_brewer(palette = palette_picked) +
scale_color_brewer(palette = palette_picked)
{pic1_main +
theme(legend.position = "left")
} %>%
cowplot::insert_xaxis_grob(
pic1_xhist,
height = grid::unit(.8, "in"),
position = "top"
) %>%
cowplot::insert_yaxis_grob(
pic1_ybox,
width = grid::unit(.8, "in"),
position = "right"
) %>%
cowplot::ggdraw()We can even call cowplot::insert_yaxis_grob() twice to add multiple margin plots for the y axis (although probably unnecessary).
In the STATA example, we will not be using stacked histograms grouped by variable race like we do in the other examples. That is because:
graph bar cannot line up with the scatter plot, even if we explicitly set the x,y-axis range to be the same, while simple histograms generated using twoway histogram can line up with the scatter plot.cd "E:\git\Stats506_midterm_project\STATA\"
import delimited "..\Data\Race Data Entry - CRDT.csv", ///
stringcols(1) numericcols(3/28) clear
keep if date == "20201101"
keep state *white *black *latinx *asian
reshape long cases_ deaths_, i(state) j(race) string
rename cases_ cases
rename deaths_ deaths
// turn race into a ordered factor
gen race_int = 1
replace race_int = 2 if race == "black"
replace race_int = 3 if race == "latinx"
replace race_int = 4 if race == "asian"
label define race_label 1 "White" 2 "Black" 3 "LatinX" 4 "Asian"
label values race_int race_label
drop race
rename race_int race
// Drop observations with cases or deaths values being NA
// Keeping these NA values will not affect scatter plot -- either axis being
// NA, the scatter won't show. But it does affect histograms since they will
// be based on a single variable (cases or deaths)
drop if cases == . | deaths == .
// categorize deaths (y-axis variable) into four groups by race
separate deaths, by(race) veryshortlabel
save race_plot_data, replaceWe will be Using the user written command grc1leg by Vince Wiggins to create the marginal plots.
grc1leg. Combine graphs into one graph with a common legend. / Program by Vince Wiggins, StataCorp vwiggins@stata.com. / Statalist distribution, 16 June 2003. / Distribution-Date: 02jun2010
Use the following line to install grc1leg:
net install grc1leg, from(http://www.stata.com/users/vwiggins/)Note:
- Run the code chunk from ‘local alpha’ to ‘twoway histogram cases’ at once to make sure the local variables can be used by all three subplots.
- The optacity settings (e.g.
mcolor(red%30)) are only available in STATA15 or above. For STATA14 or below, try using hollow circles instead of solid ones by adding the optionmsymbol(Oh).
use race_plot_data, clear
// set opacity
local alpha %30
// set x,y-axis limits for both the main plot and the marginal plots
// otherwise the axis scales might differ across the three plots
local xscale_opt xscale(range(0 415000))
local yscale_opt yscale(range(0 10500))
// create the main plot
// set color for the four racial groups
local scatter_opt mcolor(red`alpha' blue`alpha' green`alpha' purple`alpha')
scatter deaths? cases, `scatter_opt' ///
`xscale_opt' `yscale_opt' ///
legend(subtitle("Race") position(9) cols(1) region(style(none))) ///
ytitle("Deaths") xtitle("Cases") xlabel(, grid) ///
saving(yx_tr, replace)
// create the two marginal histograms
local hist_opt color(navy`alpha') bin(30)
twoway histogram deaths, `hist_opt' fraction ///
`yscale_opt' ysca(alt) horiz fxsize(25) ///
ytitle("") xlabel(, nogrid) ///
saving(hy_tr, replace)
twoway histogram cases, `hist_opt' fraction ///
`xscale_opt' xsca(alt) fysize(25) ///
xtitle("") xlabel(, grid) ylabel(, nogrid) ///
saving(hx_tr, replace)
// group the three plots together using 'grc1leg'
// 'colfirst' asks that the three plots will be displayed down column first
// 'legendfrom(yx_tr.gph)' specify using the main plot's legend
// 'position(9)' means putting the legend at 9 o'clock position
local graph_title = "Total confirmed COVID-19 deaths vs. cases, " + ///
"U.S. States (11/01/20)"
grc1leg hx_tr.gph yx_tr.gph hy_tr.gph, ///
colfirst hole(3) imargin(0 0 0 0) ///
title("`graph_title'", size(medium)) ///
legendfrom(yx_tr.gph) position(9)
// save the final output
graph export stata-marginplot-tr.png, replace
// remove the intermediate files from disk
erase hx_tr.gph
erase yx_tr.gph
erase hy_tr.gph// create the main plot
local alpha %30
local xscale_opt xscale(range(0 415000))
local yscale_opt yscale(range(0 10500))
local scatter_opt mcolor(red`alpha' blue`alpha' green`alpha' purple`alpha')
scatter deaths? cases, `scatter_opt' ///
`xscale_opt' `yscale_opt' ///
legend(subtitle("Race") position(3) cols(1) region(style(none))) ///
ytitle("Deaths") xtitle("Cases") ///
xlabel(, grid) ysca(alt) xsca(alt) ///
saving(yx_bl, replace)
// create the two marginal histograms
local hist_opt color(navy%30) bin(30)
twoway histogram deaths, `hist_opt' fraction ///
`yscale_opt' xsca(alt reverse) horiz fxsize(25) ///
ytitle("") xlabel(, nogrid) ///
saving(hy_bl, replace)
twoway histogram cases, `hist_opt' fraction ///
`xscale_opt' ysca(alt reverse) fysize(25) ///
xtitle("") xlabel(, grid) ylabel(, nogrid) ///
saving(hx_bl, replace)
// group the three plots together using 'grc1leg'
// unlike the top-right marginal plot, we do not specify'colfirst' this time
// so that the three plots will be displayed along row first
// legendfrom(yx_bl.gph) specify using the main plot's legend
// position(3) means putting the legend at 3 o'clock position
local graph_title = "Total confirmed COVID-19 deaths vs. cases, " + ///
"U.S. States (11/01/20)"
grc1leg hy_bl.gph yx_bl.gph hx_bl.gph, ///
hole(3) imargin(0 0 0 0) ///
title("`graph_title'", size(medium)) ///
legendfrom(yx_bl.gph) position(3)
// save the final output
graph export stata-marginplot-bl.png, replace
// remove the intermediate files from disk
erase hy_bl.gph
erase yx_bl.gph
erase hx_bl.gphAs mentioned above, stacked histograms can be difficult to generate in STATA, and they are not suitable as marginal plots since they cannot properly line up with the main scatter plot.
However, as an extended example, we will walk through the code to create a stacked histogram of variable cases grouped by race, using the racial data race_plot_data.dta in the marginal plot examples and the graph bar command.
cd "E:\git\Stats506_midterm_project\STATA\"
use race_plot_data, clear
// check out the data format
list state race cases in 1/5
// check out the summary statistics for variable 'cases'
summarize cases . cd "E:\git\Stats506_midterm_project\STATA\"
E:\git\Stats506_midterm_project\STATA
. use race_plot_data, clear
. // check out the data format
. list state race cases in 1/5
+-----------------------+
| state race cases |
|-----------------------|
1. | AK Asian 680 |
2. | AK Black 574 |
3. | AK White 5063 |
4. | AL Asian 628 |
5. | AL Black 38319 |
+-----------------------+
. // check out the summary statistics for variable 'cases'
. summarize cases
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
cases | 165 32516.75 51177.92 74 401503
Since the graph bar command is not designed for histograms, we need to manually turn the continuous variable cases into intervals and store as a categorical variable cut_cases. After that, we can calculate the frequency of each intervals and produce the histogram.
From the output of summarize cases, we know that values of cases falls in range [74, 401503]. Therefore, we can cut it into 30 intervals from 0 to 402000.
Note: since R’s library Statamarkdown runs each STATA code chunk as an stand-alone .do file, the following code chunks will contain some repetitive code.
cd "E:\git\Stats506_midterm_project\STATA\"
use race_plot_data, clear
// use 'egen ... cut' with option 'at' to turn 'cases' into intervals
egen cut_cases = cut(cases), at(0(13400)402000)
list cases cut_cases in 1/10 . cd "E:\git\Stats506_midterm_project\STATA\"
E:\git\Stats506_midterm_project\STATA
. use race_plot_data, clear
. // use 'egen ... cut' with option 'at' to turn 'cases' into intervals
. egen cut_cases = cut(cases), at(0(13400)402000)
. list cases cut_cases in 1/10
+------------------+
| cases cut_ca~s |
|------------------|
1. | 680 0 |
2. | 574 0 |
3. | 5063 0 |
4. | 628 0 |
5. | 38319 26800 |
|------------------|
6. | 59056 53600 |
7. | 1130 0 |
8. | 22837 13400 |
9. | 67721 67000 |
10. | 2745 0 |
+------------------+
Interpretation of cut_cases: value 0 represents the range [0, 13400), value 13400 represents the range [13400, 26800), etc.
We then use tabcount to calculate the number of observations for each level of cut_cases and race. Install tabcount using ssc install tabcount if you have not already done that.
cd "E:\git\Stats506_midterm_project\STATA\"
use race_plot_data, clear
egen cut_cases = cut(cases), at(0(13400)402000)
quietly tabcount cut_cases race, v1(0(13400)402000) v2(1/4) replace
list in 1/20 . cd "E:\git\Stats506_midterm_project\STATA\"
E:\git\Stats506_midterm_project\STATA
. use race_plot_data, clear
. egen cut_cases = cut(cases), at(0(13400)402000)
.
. quietly tabcount cut_cases race, v1(0(13400)402000) v2(1/4) replace
. list in 1/20
+--------------------------+
| _freq cut_ca~s race |
|--------------------------|
1. | 11 0 White |
2. | 5 13400 White |
3. | 4 26800 White |
4. | 4 40200 White |
5. | 7 53600 White |
|--------------------------|
6. | 4 67000 White |
7. | 2 80400 White |
8. | 3 93800 White |
9. | 2 107200 White |
10. | 3 120600 White |
|--------------------------|
11. | 1 134000 White |
12. | 1 147400 White |
13. | 0 160800 White |
14. | 1 174200 White |
15. | 0 187600 White |
|--------------------------|
16. | 0 201000 White |
17. | 1 214400 White |
18. | 0 227800 White |
19. | 0 241200 White |
20. | 0 254600 White |
+--------------------------+
It is important to use tabcount here rather than a simple bysort command (see the following code chunk). tabcount cut_cases race, v1(0(13400)402000) v2(1/4) will preserve every intervals of cases from [0, 13400) to [388600, 402000), even if there is no observation in that range. To create a histogram, we will need the zero-frequency intervals.
On the other hand, bysort only keeps the levels that actually appear in the dataset (with non-zero frequencies), which means the intervals of cases without data points will not be preserved.
cd "E:\git\Stats506_private\group_proj\STATA\"
use race_plot_data, clear
egen cut_cases = cut(cases), at(0(13400)402000)
// DO NOT USE THIS METHOD for the histogram!
bysort cut_cases race: generate freq = _N
keep cut_cases race freq
duplicates drop
list . cd "E:\git\Stats506_private\group_proj\STATA\"
E:\git\Stats506_private\group_proj\STATA
. use race_plot_data, clear
. egen cut_cases = cut(cases), at(0(13400)402000)
.
. // DO NOT USE THIS METHOD for the histogram!
. bysort cut_cases race: generate freq = _N
. keep cut_cases race freq
. duplicates drop
Duplicates in terms of all variables
(134 observations deleted)
. list
+--------------------------+
| race cut_ca~s freq |
|--------------------------|
1. | White 0 11 |
2. | Black 0 25 |
3. | LatinX 0 6 |
4. | Asian 0 45 |
5. | White 13400 5 |
|--------------------------|
6. | Black 13400 8 |
7. | LatinX 13400 4 |
8. | White 26800 4 |
9. | Black 26800 5 |
10. | LatinX 26800 7 |
|--------------------------|
11. | Asian 26800 1 |
12. | White 40200 4 |
13. | Black 40200 6 |
14. | LatinX 40200 2 |
15. | White 53600 7 |
|--------------------------|
16. | Black 53600 1 |
17. | White 67000 4 |
18. | LatinX 67000 1 |
19. | White 80400 2 |
20. | White 93800 3 |
|--------------------------|
21. | Black 93800 1 |
22. | LatinX 93800 1 |
23. | White 107200 2 |
24. | Black 107200 1 |
25. | White 120600 3 |
|--------------------------|
26. | White 134000 1 |
27. | White 147400 1 |
28. | White 174200 1 |
29. | White 214400 1 |
30. | LatinX 227800 1 |
|--------------------------|
31. | LatinX 388600 1 |
+--------------------------+
With the frequency table generated using tabcount, we can now use command graph bar (asis) _freq, over(race) over(cut_cases) asyvar stack to create the stacked histograms. Option asyvar asks that the frequencies be stacked over race – the first over() group.
Since the tabcount command with option replace will replace the current dataset with the frequency table, we can use the preserve and restore commands to create a snapshot of the current data and switch back to it later.
cd "E:\git\Stats506_private\group_proj\STATA\"
use race_plot_data, clear
egen cut_cases = cut(cases), at(0(13400)402000)
preserve
quietly tabcount cut_cases race, v1(0(13400)402000) v2(1/4) replace
graph bar (asis) _freq, over(race, sort(race) descending) ///
over(cut_cases, gap(0) label(nolabels)) asyvar stack
graph export stata-stacked_histogram-graph_bar.png, replace
restoreWe can further specify the colors for each racial group:
cd "E:\git\Stats506_private\group_proj\STATA\"
use race_plot_data, clear
egen cut_cases = cut(cases), at(0(13400)402000)
preserve
quietly tabcount cut_cases race, v1(0(13400)402000) v2(1/4) replace
local alpha %40
graph bar (asis) _freq, over(race, sort(race) descending) ///
over(cut_cases, gap(0) label(nolabels)) asyvar stack ///
bar(1, color(red`alpha')) bar(2, color(blue`alpha')) ///
bar(3, color(green`alpha')) bar(4, color(purple`alpha'))
graph export stata-stacked_histogram-graph_bar-colored.png, replace
restoreApart from the stacked histogram, there are two much simpler methods to add another dimension of information in histograms: (i) using command bihist to create histograms with exactly two categories; and (ii) using command byhist to create interlaced histograms with two or more categories.
These user written commands are designed for histograms, which saves us the trouble of computing frequency over intervals. Therefore, the code is much more succinct and easy to remember.
Install bihist and byhist using ssc install bihist and ssc install byhist if you have not already done that.
use race_plot_data, clear
bihist cases if race == 1 | race == 2, by(race)use race_plot_data, clear
byhist cases, by(race)Prior to any plotting, we’ll organize the csv file so it only contains the relevant data necessary for plotting.
%% Date Prep
% Read in data
fileName = 'Race Data Entry - CRDT.csv';
raceData = readtable(fileName);
%Select only 11/01/20 data
newestDate = 20201101;
raceData = raceData(raceData.Date == newestDate,:);
% Convert all case/death # to double
colnames = raceData.Properties.VariableNames;
for i = 3:length(colnames)
if iscell(raceData.(genvarname(colnames{i})))
raceData.(genvarname(colnames{i})) = ...
str2double(raceData.(genvarname(colnames{i})));
end
endLet’s narrow down the dataset to only look at 4 different racial groups: White, Black, Latinx, and Asian since including all available racial categories would provide too much information on a single figure.
%% Get race info
% Grabbing the different races identified in dataset
race_idx = find(contains(colnames, 'Deaths'));
race_idx = race_idx(2 : 10); % remove total and ethnicity info
races = raceData(end, race_idx);
raceID = strrep(races.Properties.VariableNames, 'Deaths_','');
% selecting relevant columns
plotData = raceData(1 : end, {'Date', 'State', 'Cases_White', ...
'Cases_Black', 'Cases_LatinX', 'Cases_Asian', 'Deaths_White', ...
'Deaths_Black', 'Deaths_LatinX', 'Deaths_Asian'});
% reshape case variable for plotting
case_col = startsWith(plotData.Properties.VariableNames, 'Cases');
cases = table2array(plotData(:, case_col));
cases = reshape(cases, [], 1);
% reshape death variable for plotting
death_col = startsWith(plotData.Properties.VariableNames, 'Deaths');
deaths = table2array(plotData(:, death_col));
deaths = reshape(deaths, [], 1);MATLAB’s scatterhist requires that we pass in a group label that is the same dimension as the x and y datapoints being plotted, so we create a new array, race_label, containing matched labels.
% Generate race labels matching the column vector for cases/deaths
race_label = [];
for i = 1 : 4
race_label = [race_label; repmat(raceID(i), size(raceData,1), 1)];
endNow that all the variables are ready, we use the scatterhist function to create a scatter plot with histogram plotted along the margins. Here, we specify group label colors to match colors used in R plot. We also specify the location where we want the histograms to be plotted with respect to the scatter plot using argument pair (‘Location’, ‘NorthEast’) and the direction that we want the histogram to point by argument pair (‘Direction’, ‘out’).
x = cases;
y = deaths;
% Specify color to be used for each race grouping
red = [236, 95, 97] / 255;
blue = [115, 165, 205] / 255;
green = [131, 199, 129] / 255;
purple = [173, 113, 182] / 255;
c_array = [red; blue; green; purple];
figure
% scatterhist does not allow us to adjust the transparency of the plots,
% to do this we can generate marginal histogram from scratch using subplot
h = scatterhist(x, y, 'Group', race_label, 'Style', 'bar', ...
'marker', '.', 'Location', 'NorthEast', 'Direction','out', ...
'color', c_array, 'MarkerSize', 16);
[leg,~] = legend('show');
title(leg,'Race')
xlabel('Cases')
ylabel('Deaths')
sgtitle('Total confirmed COVID-19 deaths vs. cases, U.S. States (11/03/20)') While the built-in function scatterhist is useful for quick plotting because many properties of the figure and the layout do not have to be specified, this function faces the downside of not being able to customize various settings which the user may be looking to specify. Alternatively to scatterhist, we may use subplot functionality to get around this issue by manually creating a marginal histogram by combining the 3 components (scatterplot + 2 histograms). Please refer to the section ‘subplot’ for the walkthrough.
In MATLAB, we can also manually create a marginal histogram by arranging together 2 histograms and a scatterplot into a single figure. Because we’re working with multiple, separate functions like hist and scatter with distinct plot handles for each group, we have much larger customizability of the arranged plots, whereas scatterhist prevents users from grabbing all of the figure’s property handles and modifying them, like transparency and bin width specific to each histogram.
We’ll be using the toolbox ‘subaxis’ which builds off the built-in subplot functionality in order to minimize the margins between each subplots. This toolbox can be directly downloaded from link but it is also included in the repository containing this tutorial.
addpath('subaxis')Working off the preprocessed data from earlier steps above, we now have to manually plot a histogram for each race and overlay them on top of one another. We do this two times for each of the two histograms (deaths and cases) that’s needed for a marginal histogram.
subaxis functions allows us to arrange multiple plots into a single figure and additionally allows us to specify the margins surrounding each subplot.
%% Using subplot to create marginal histogram
% breaking race data into separate arrays
white = find(race_label == "White");
black = find(race_label == "Black");
latinx = find(race_label == "LatinX");
asian = find(race_label == "Asian");figure(1)
clf
% y-data histogram
ah1 = subaxis(2, 2, 4, 'sh', 0, 'sv', 0.01, 'padding', 0, 'margin', 0.1);
p1 = histogram(deaths(white), 'Orientation', 'horizontal', ...
'Normalization', 'probability', 'BinWidth', 1000, ...
'Facecolor', red);
hold on
p2 = histogram(deaths(black), 'Orientation', 'horizontal', ...
'Normalization', 'probability', 'BinWidth', 1000, ...
'Facecolor', blue);
hold on
p3 = histogram(deaths(latinx), 'Orientation', 'horizontal', ...
'Normalization', 'probability', 'BinWidth', 1000, ...
'Facecolor', green);
hold on
p4 = histogram(deaths(asian), 'Orientation', 'horizontal', ...
'Normalization', 'probability', 'BinWidth', 1000, ...
'Facecolor', purple);
% x-data histogram
ah2 = subaxis(2, 2, 1,'sh', 0.03, 'sv', 0.03, 'padding', 0, 'margin', 0.1);
histogram(cases(white), 'Normalization', 'probability', ...
'BinWidth', 20000, 'Facecolor', red)
hold on
histogram(cases(black), 'Normalization', 'probability', ...
'BinWidth', 20000, 'Facecolor', blue)
hold on
histogram(cases(latinx), 'Normalization', 'probability', ...
'BinWidth', 20000, 'Facecolor', green)
hold on
histogram(cases(asian), 'Normalization', 'probability', ...
'BinWidth', 20000, 'Facecolor', purple)The transparency of the scatter points can now be specified because this feature is supported through built-in function scatter, using the argument Markerfacealpha. I have specified 60% (denoted as 0.6) opacity for all the points.
% scatterplot
ah3 = subaxis(2, 2, 3, 'sh', 0.03, 'sv', 0.01, ...
'padding', 0, 'margin', 0.1);
hold on
scatter(cases(white), deaths(white), 'filled', ...
'MarkerFaceColor', red, 'Markerfacealpha', 0.6)
scatter(cases(black), deaths(black), 'filled', ...
'MarkerFaceColor', blue, 'Markerfacealpha', 0.6)
scatter(cases(latinx), deaths(latinx), 'filled', ...
'MarkerFaceColor', green, 'Markerfacealpha', 0.6)
scatter(cases(asian), deaths(asian), 'filled', ...
'MarkerFaceColor', purple, 'Markerfacealpha', 0.6)We need to remove all the boxes and axes information from the histograms since we’re combining them with the scatterplot. Similar to function scatterhist‘s argument ’Direction’ specifying the direction which the histogram plots, View property of the subplots can be adjusted to rotate the histograms in the desired direction (commented out in the code below ah1.View = [180, -90]). Similarly, the arrangement of the subplot is reponsible for specifying the location of the histograms, whether they’re to the top or bottom, and left or right of the scatter plot.
% Remove boxes and axes information from the histogram
linkaxes([ah1, ah3], 'y')
linkaxes([ah3, ah2], 'x')
ah1.Box = 'off';
% ah1.View = [180, -90];
ah1.Visible = 'off';
ah2.Visible = 'off';
ah2.Box = 'off';
% ah2.View = [0, -90];We finish off by creating a single universal legend for all three plots. Since this will not happen automatically when manually plotting and arranging a marginal histogram.
% Create single universal legend
h = [p1;p2;p3;p4];
hold on
lgd = legend(h, 'White', 'Black', 'Latinx', 'Asian');
lgd.Box = 'off';
newPosition = [0.5 0.6 0.13 0.13];
newUnits = 'normalized';
set(lgd, 'Position', newPosition, 'Units', newUnits);
title(lgd, 'Races')
% Give title and label axis in scatterplot
sgtitle('Total confirmed COVID-19 deaths vs. cases, U.S. States (11/03/20)')
xlabel('Cases')
ylabel('Deaths')import numpy as np
import pandas as pd
import matplotlib.patches as mpatches
import matplotlib.pyplot as pltBefore plotting, we organize the csv file Race Data Entry - CRDT.csv which contains the data we need for plotting the marginal histogram. We set the data to be Nov 1, 2020 which stays the same with other parallel examples.
df = pd.read_csv('./Data/Race Data Entry - CRDT.csv')
new = df[df['Date'] == 20201101]To avoid too much information showed in a single plot, we are focusing on four different racial groups: White, Black, LatinX, and Asian.
new = new[['Cases_Asian', 'Cases_Black', 'Cases_LatinX','Cases_White',
'Deaths_Asian', 'Deaths_Black', 'Deaths_LatinX', 'Deaths_White']]
new = new.dropna(axis=0, how='any') # drop rows that have any NaN
x1 = new['Cases_Asian']
y1 = new['Deaths_Asian']
x2 = new['Cases_Black']
y2 = new['Deaths_Black']
x3 = new['Cases_LatinX']
y3 = new['Deaths_LatinX']
x4 = [float(x) for x in new['Cases_White']]
y4 = new['Deaths_White']Here we specify some important variables which formulate our further plotting. These variables (width, length, spacing, bottom, etc) determine the size of main-plot and marginal-plots.
left, width = 0.1, 0.65
bottom, height = 0.1, 0.65
spacing = 0.005
rect_scatter = [left, bottom, width, height]
rect_histx = [left, bottom + height + spacing, width, 0.2]
rect_histy = [left + width + spacing, bottom, 0.2, height]
xmax = max(np.max(x1), np.max(x2), np.max(x3))
ymax = max(np.max(y1), np.max(y2), np.max(y3))
binsx = np.arange(0, xmax + 1000, 9000)
binsy = np.arange(0, ymax + 200, 200)matplotlib We can divide the plotting into 3 steps:
fig = plt.figure(figsize=(11, 8))
ax = fig.add_axes(rect_scatter)
plt.xlabel('Cases');
plt.ylabel('Deaths');
plt.title('Total confirmed COVID-19 deaths vs. cases, U.S. States (11/01/20)',
x = 0.6, y = -0.16, fontsize = 14);
plt.grid(color='LightGrey')
purple_patch = mpatches.Patch(color='purple', label='Asian')
blue_patch = mpatches.Patch(color='blue', label='Black')
green_patch = mpatches.Patch(color='green', label='LatinX')
red_patch = mpatches.Patch(color='red', label='White')
ax_histx = fig.add_axes(rect_histx, sharex=ax)
ax_histy = fig.add_axes(rect_histy, sharey=ax)
plt.legend(handles=[red_patch, blue_patch, green_patch, purple_patch],
title='Race', loc = 'best');
ax.scatter(x1, y1, color = 'purple', alpha = 0.5);
ax_histx.hist(x1, bins=binsx, color = 'purple', alpha = 0.5);
ax_histy.hist(y1, bins=binsy, orientation='horizontal',
color = 'purple', alpha = 0.5);
ax.scatter(x2, y2, color = 'blue', alpha = 0.5);
ax_histx.hist(x2, bins=binsx, color = 'blue', alpha = 0.5);
ax_histy.hist(y2, bins=binsy, orientation='horizontal',
color = 'blue', alpha = 0.5);
ax.scatter(x3, y3, color = 'green', alpha = 0.5);
ax_histx.hist(x3, bins=binsx, color = 'green', alpha = 0.5);
ax_histy.hist(y3, bins=binsy, orientation='horizontal',
color = 'green', alpha = 0.5);
ax.scatter(x4, y4, color = 'red', alpha = 0.5);
ax_histx.hist(x4, bins=binsx, color = 'red', alpha = 0.5);
ax_histy.hist(y4, bins=binsy, orientation='horizontal',
color = 'red', alpha = 0.5);
plt.show()
# plt.savefig("python_marginal.png", dpi=200, bbox_inches='tight')In our project, we use bubble plot to see the relationship between stringency index, human development index, and total # of deaths in each country. Each bubble represents a single country with x-axis measuring the stringency index, y-axis measuring the human development index, and the size of the bubble measuring the total number of deaths in the country at the time of 10/20/20. Bubble plot is useful because it allows us to easily visualize 3 dimensions of the data rather than the conventional 2.
# setup
library(tidyverse)filename_covid = "./Data/owid-covid-data.csv"
plot2_data = read_delim(filename_covid, delim = ",") %>%
select(date, iso_code, continent, total_deaths_per_million,
stringency_index, human_development_index) %>%
filter(date == "2020-10-20")| date | iso_code | continent | total_deaths_per_million | stringency_index | human_development_index |
|---|---|---|---|---|---|
| 2020-10-20 | ABW | North America | 318.453 | 50.46 | NA |
| 2020-10-20 | AFG | Asia | 38.455 | 16.67 | 0.498 |
| 2020-10-20 | AGO | Africa | 7.515 | 71.30 | 0.581 |
| 2020-10-20 | AIA | North America | NA | NA | NA |
| 2020-10-20 | ALB | Europe | 157.759 | 42.59 | 0.785 |
| date | iso_code | continent | total_deaths_per_million | stringency_index | human_development_index |
|---|---|---|---|---|---|
| 0/214 | 1/214 | 2/214 | 23/214 | 42/214 | 33/214 |
plot2_title = paste0(
"Total # of Deaths Across Gov. Stringency Index",
" and Human Development Index"
)
plot2_data %>%
filter(!is.na(continent)) %>% # remove 'NA' from the plot's legend
ggplot(aes(x = stringency_index, y = human_development_index,
size = total_deaths_per_million, color = continent)) +
theme_bw() +
geom_point(alpha = .5) +
scale_size_continuous(
name = "Total Death (Per Mil.)",
breaks = seq(100, 1300, 300),
range = c(2, 14)
) +
scale_color_brewer(
name = "Continent",
palette = "Set1"
) +
scale_x_continuous(
name = "Stringency Index",
breaks = seq(0, 100, 10)
) +
scale_y_continuous(
name = "Human Development Index",
breaks = seq(0, 1, 0.1)
) +
ggtitle(plot2_title) +
theme(
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5, size = 12),
axis.title = element_text(size = 10)
) +
guides(color = guide_legend(override.aes = list(size = 3))) # enlarge the symbol size in the color legendcd "E:\git\Stats506_midterm_project\STATA\"
import delimited "..\Data\owid-covid-data.csv", clear
// keep only relevant variables & data points
keep if date == "2020-10-20"
keep iso_code continent total_deaths_per_million ///
stringency_index human_development_index
drop if continent == "" //drop international world data points
// drop observations if any of the index measures are missing
drop if stringency_index ==. | ///
human_development_index==. | ///
total_deaths_per_million ==.
save covid_plot_data, replaceNote: The optacity settings (e.g.
mcolor(red%30)) are only available in STATA15 or above. For STATA14 or below, try using hollow circles instead of solid ones by adding the optionmsymbol(Oh).
use covid_plot_data, clear
// categorize the human_development_index into six groups by continents
separate human_development_index, by(continent) veryshortlabel
local vars human_development_index? stringency_index ///
[aw = total_deaths_per_million]
local symbol O // solid circles
local size .75 // make all points smaller (75% of the original size)
local alpha %30
local options xscale(range(0 80)) yscale(range(0.32 1)) ///
mcolor(red`alpha' blue`alpha' green`alpha' ///
purple`alpha' orange`alpha' yellow`alpha') ///
msymbol(`symbol' `symbol' `symbol' `symbol' `symbol' `symbol') ///
msize(`size' `size' `size' `size' `size' `size')
local graph_title = ("Total # of Deaths Across Gov. Stringency Index " + ///
"and Human Development Index")
scatter `vars', `options' ///
legend(subtitle("Continent") position(3) cols(1) ///
region(style(none))) ///
title("`graph_title'", size(*.7)) ///
ytitle("Human Development Index") xtitle("Government Stringency Index") ///
yscale(titlegap(3)) // enlarge the gap between y-axis title and ticks
// to avoid overlap
// save the plot
graph export stata-bubbleplot.png, replaceBubble plot is a relatively new function (bubblechart) introduced to MATLAB in R2020b. We’ll need a minimum version of R2020b to be able to run this tutorial.
%% BubblePlots
% For ref see: https://www.mathworks.com/help/matlab/ref/bubblechart.html
%% Data Prep
fileName = './Data/owid-covid-data.csv';
covid_orig = readtable(fileName);
% Select 2020-10-20 Data
newestDate = datetime([2020 10 20]);
covid = covid_orig(covid_orig.date == newestDate,:);Checking the # of NaN values for the stringency index:
sum(isnan(covid.stringency_index))ans = 40
Checking the # of NaN values for the human development index:
sum(isnan(covid.human_development_index))ans = 33
% Variables of interest (VOI)
voi = {'location', 'continent', 'total_cases_per_million', ...
'total_deaths_per_million','stringency_index',...
'human_development_index'};
% Remove international and world info and keep only variables of interest
covid = covid(1:end-2, voi);We want to group the countries by its continent membership to see if there’re any patterns to certain continents being more stringent in COVID-19 restrictions or if certain continents are being impacted more heavily with COVID-19 deaths.
% Convert continent label to color label to be used in bubble plot
for k = 1 : size(covid, 1)
if strcmp(covid.continent(k), 'Africa')
covid.continent{k} = 1;
elseif strcmp(covid.continent(k), 'Asia')
covid.continent{k} = 2;
elseif strcmp(covid.continent(k), 'Europe')
covid.continent{k} = 3;
elseif strcmp(covid.continent(k), 'North America')
covid.continent{k} = 4;
elseif strcmp(covid.continent(k), 'Oceania')
covid.continent{k} = 5;
elseif strcmp(covid.continent(k), 'South America')
covid.continent{k} = 6;
end
end
covid = sortrows(covid, 2);
x = table2array(covid(:,5)); % x-axis: Stringency index
y = table2array(covid(:,6)); % y-axis: Human development index
sz = table2array(covid(:,4)); % size: Total deaths per million
c = cell2mat(table2array(covid(:,2))); % color: continentIn MATLAB, to create a bubble chart with different grouping colors, different groups have to be plotted separately and then overlaid on top of one another. We can pass in an array of RGB values into bubblechart as the third argument to specify the color we want the bubbles to be for each group.
% Stringency index of countries grouped by continent
x1 = x(c==1); x2 = x(c==2); x3 = x(c==3);
x4 = x(c==4); x5 = x(c==5); x6 = x(c==6);
% Human development index of countries grouped by continent
y1 = y(c==1); y2 = y(c==2); y3 = y(c==3);
y4 = y(c==4); y5 = y(c==5); y6 = y(c==6);
% Total Deaths per mil. of countries grouped by continent
sz1 = sz(c==1); sz2 = sz(c==2); sz3 = sz(c==3);
sz4 = sz(c==4); sz5 = sz(c==5); sz6 = sz(c==6);
%% Plot data in a tiled chart layout
figure
% colors for plotting
red = [234, 82, 84]/255;
blue = [105, 158, 201]/255;
green = [99, 185, 96]/255;
purple = [177, 122, 186]/255;
orange = [255, 186, 117]/255;
yellow = [255, 255, 153]/255;
t = tiledlayout(1,1);
nexttile
bubblechart(x1, y1, sz1, red) % Africa
hold on
bubblechart(x2, y2, sz2, blue) % Asia
hold on
bubblechart(x3, y3, sz3, green) % Europe
hold on
bubblechart(x4, y4, sz4, purple) % N. America
hold on
bubblechart(x5, y5, sz5, orange) % Oceania
hold on
bubblechart(x6, y6, sz6, yellow) % S. America
hold off
sgtitle('Total # of Deaths Across Gov. Stringency Index and Human Development Index')
xlabel('Stringency Index')
ylabel('Human Development Index')We want to include both the legend for the continents and the legend for the size of the bubbles (bubblelegend). In order to include these 2 separate legends, we have to specify a layout for the tile.
blgd = bubblelegend('Total Death (Per Mil.)','Location','eastoutside');
lgd = legend('Africa','Asia', 'Europe', 'North America', ...
'Oceania', 'South America');
title(lgd, 'Continent')
blgd.Layout.Tile = 'east';
lgd.Layout.Tile = 'east';While the default value of the transparency of the bubbles is set at 60% opacity, we’ll bring it down to 40% so it’s easier to see all the points since we have a large number of data points.
b1.MarkerFaceAlpha = 0.4; b2.MarkerFaceAlpha = 0.4;
b3.MarkerFaceAlpha = 0.4; b4.MarkerFaceAlpha = 0.4;
b5.MarkerFaceAlpha = 0.4; b6.MarkerFaceAlpha = 0.4;In the Python part, we make implement Bubble Plot using two different packages: seaborn and matplotlib. These two packages make plots by different ways which worth us to have a comparison between them.
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as snsdf2 = pd.read_csv("./Data/owid-covid-data.csv")
new2 = df2[df2['date'] == '2020-10-20']
new2 = new2[['continent', 'total_deaths_per_million',
'stringency_index', 'human_development_index']]
new2 = new2.dropna(axis=0, how='any')In Seaborn, we are going to use the seaborn.scatterplot function to implement the bubble plot. The size parameter takes the value of total_deaths_per_million which indicates the radius of each bubble. Here we set the transparency indicator alpha to be 40% to help us visualizing bubbles which are overlapped.
# draw plot using seaborn
plt.figure(figsize=(11, 7));
sns.scatterplot(x='stringency_index',
y='human_development_index',
size=new2['total_deaths_per_million'],
sizes = (10, 1250),
alpha=0.6,
hue=new2['continent'],
data=new2);
# Put the legend out of the figure
plt.legend(bbox_to_anchor=(1.03, 1),borderaxespad=0,
prop={'size':12}, handlelength=2,
frameon=False);
plt.rcParams.update({'legend.labelspacing':1.7});
plt.xlabel('Stringency Index');
plt.ylabel('Human Development Index');
plt.title('Total # of Death Across Stringency Index & Human Development Index',
fontsize = 13);
plt.grid(color='LightGrey');
plt.tight_layout();
plt.show()
# plt.savefig('python_bubble_seaborn.png', dpi=200, bbox_inches='tight')Here we use the matplotlib.scatter function to implement the bubble plot. Different from Seaborn, we use a loop here to combine continents with colors.
continent_list = ['Africa', 'Asia', 'Europe', 'North America', 'Oceania', 'South America']
color_list = ['red', 'blue', 'green', 'purple', 'orange', 'yellow']
colors = {'Africa':'red', 'Asia':'blue', 'Europe':'green',
'North America':'purple', 'Oceania': 'orange',
'South America': 'yellow'}
l = []
for i in range(6):
l.append(mpatches.Patch( color=color_list[i],
alpha=0.4,
label=continent_list[i]))
# make the plot
plt.figure(figsize = (9,8))
plt.scatter('stringency_index',
'human_development_index',
s='total_deaths_per_million',
c=new2['continent'].apply(lambda x: colors[x]),
alpha=0.4,
data=new2)
# set labels, title, grid, and layout
plt.grid(color='LightGrey')
plt.xlabel('Stringency Index')
plt.ylabel('Human Development Index')
plt.title('Total # of Death Across Stringency Index & Human Development Index',
fontsize = 13)
plt.legend(title = 'Continent', handles=l, loc = (1.01,0.57))
plt.show()
# plt.savefig("python_bubble_matplotlib.png", dpi=200, bbox_inches='tight')Interactive plots allow us to manipulate the viewing of the data points plotted and can be useful in focusing on specific aspect of the data. In this example, we will plot an interactive histogram comparing the # of cases, deaths, tests performed, and hospitalized across a few different countries in Europe (dataset). Clicking the legend in the plot will either hide or show the bar associated with the specific item in the legend.
# setup
library(tidyverse)filename_covid = "./Data/owid-covid-data.csv"
plot3_data = read_delim(
filename_covid, delim = ",",
col_types = cols( # specify the types for our variables of interest
# otherwise some will be read in as 'logical' and lead to incorrect values
date = col_character(),
iso_code = col_character(),
location = col_character(),
total_cases_per_million = col_double(),
total_deaths_per_million = col_double(),
hosp_patients_per_million = col_double(),
total_tests_per_thousand = col_double()
)
) %>%
select(date, iso_code, location,
total_cases_per_million,
total_deaths_per_million,
hosp_patients_per_million,
total_tests_per_thousand) %>%
filter(
iso_code %in% c("AUT", "BEL", "BGR", "CZE", "DNK"),
date == "2020-10-20"
) %>%
mutate(
total_tests_per_thousand = total_tests_per_thousand * 1e03
# turn 'per thousand' to 'per million'
) %>%
pivot_longer(
cols = total_cases_per_million:total_tests_per_thousand,
names_pattern = "(.*)_per",
names_to = "variable"
) %>%
mutate(
variable = factor(
variable,
levels = c("total_cases", "total_deaths",
"hosp_patients", "total_tests"),
# add labels so that the legend items in the plot make more sense
label = c("Cases", "Deaths", "Hospitalizations", "Tests"),
# order the factor levels to control for the order of variables
# displayed in the bar plot
ordered = TRUE)
)| date | iso_code | location | variable | value |
|---|---|---|---|---|
| 2020-10-20 | AUT | Austria | Cases | 7395.963 |
| 2020-10-20 | AUT | Austria | Deaths | 102.039 |
| 2020-10-20 | AUT | Austria | Hospitalizations | 82.608 |
| 2020-10-20 | AUT | Austria | Tests | 218961.000 |
| 2020-10-20 | BEL | Belgium | Cases | 22606.875 |
| date | iso_code | location | variable | value |
|---|---|---|---|---|
| 0/20 | 0/20 | 0/20 | 0/20 | 0/20 |
Using library ggplotly, we can easily add interactivity to a ggplot object, and the interactivity will be retained in an HTML output of Rmarkdown.
Try the following:
plot3_title = "Comparison of # of Cases, Deaths, Hospitalizations, and Testing"
plot3_base = plot3_data %>%
ggplot(aes(x = location, y = value,
fill = variable)) +
theme_bw() +
geom_col(position = "dodge", width = .7) +
scale_fill_brewer(
name = "Total Number",
palette = "Paired"
) +
scale_x_discrete(name = "Countries") +
scale_y_continuous(name = "# per Mil.") +
ggtitle(plot3_title) +
theme(
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 0.5, size = 12),
axis.title = element_text(size = 10)
)
plot3_base %>%
plotly::ggplotly()Due to the large data range, we use a log10 scale to transform the y-axis value to ensure all variables of interest are visible.
Note that when the mouse hovers over a bar, the value presented will still be the original value before scaling. That way we can know the exact value without referencing y-axis labels.
# transform the y-axis values using a log10 scale
{plot3_base +
scale_y_continuous(
name = "log10(# per Mil.)",
trans = "log10"
) +
theme(
axis.text.y = element_blank() # remove the y-axis label
# because they are not helpful due to the large data range
)
} %>%
plotly::ggplotly()Note that the legend position is higher than static plots. We can move the legend items downwards using plotly::layout(), and move the legend title downwards by (i) hiding the original legend title, and (ii) add a new legend title using plotly::add_annotations.
{plot3_base +
scale_y_continuous(name = "log10(# per Mil.)", trans = "log10") +
theme(
axis.text.y = element_blank(),
legend.title = element_blank() # hide the legend title
)
} %>%
plotly::ggplotly() %>%
# move legend downwards
plotly::layout(legend=list(y=0.8, yanchor="top")) %>%
# add the legend title back but to a lower position
plotly::add_annotations(
text="Total Number",
xref="paper", x=1.02, xanchor="left",
yref="paper", y=0.8, yanchor="bottom",
legendtitle=TRUE,
showarrow=FALSE
)In this tutorial, we’ll go over 2 different ways we can introduce interactive feature with MATLAB. 1.) Clickable Legend Toolbox is a MATLAB FileExchange toolbox that allows users to hide or unhide data points by clicking on the grouping variable in the legend. 2.) plotly-graphing-library-for-matlab is a toolbox written by Plotly distributed through github, allowing MATLAB users to create and export interactive charts in the web brower by incorporating some of the same functionality of Plotly offered in R and Python. However, it should be noted that the functionality built into MATLAB plotly is much more restricted than those available in R and Python. After the preprocessing step, the tutorial will be divided to show these 2 different methods.
%% Data Prep
addpath('clickable_legend')
fileName = 'owid-covid-data.csv';
covid_orig = readtable(fileName);
% Select 2020-10-20 Data
newestDate = covid_orig.date(217);
covid_hosp = covid_orig(covid_orig.date == newestDate,:);
% Keep only countries that have hospitalization data
covid_hosp = covid_hosp(~isnan(covid_hosp.hosp_patients_per_million),:);
% Plotting the following variables:
% # of deaths, # of cases, # of tests, # hospitalizations
voi = {'location', 'total_cases_per_million', ...
'total_deaths_per_million','hosp_patients_per_million',...
'total_tests_per_thousand'};
% Update dataset to only include variables of interest
covid_hosp = covid_hosp(:, voi);
% Convert total_tests_per_thousand to total_tests_per_million
covid_hosp(:,'total_tests_per_thousand') = ...
table(1000*covid_hosp.total_tests_per_thousand);
covid_hosp.Properties.VariableNames{5} = 'total_tests_per_million';Clickable Legend Toolbox was downloaded from MATLAB FileExchange in order to make the legend interactive. It should be noted that the original toolbox had a small bug associated with a coloring issue in the function (togglevisibility) that had to be corrected. The updated version of the toolbox is available in the project repository on github.
Clickable Legend functionality can be added to any plots that has a multiple groups that can create a legend. In our case, we use bar graphs to compare the number of cases, deaths, hospitalization, and testing across 5 different European countries. However, it can be added to line plots and scatter plots (by getting rid of the line and specifying marker shape and size).
%% Plotting
nCountries = 5;
X = categorical(covid_hosp.location(1:nCountries));
Y = [covid_hosp{1:nCountries,2}, covid_hosp{1:nCountries,3},...
covid_hosp{1:nCountries,4}, covid_hosp{1:nCountries,5}];
figure
bar(X,Y)Once the base plot has been plotted, we add the interactive functionality by calling function ‘clickableLegend’ and enable the toggling of the visibility of the bars by clicking on the text shown in the legend. In calling the function (clickableLegend), we specify the name of each legend entry and the optional location of the legend, through passing argument pair, (‘Location’, ‘eastoutside’).
Note: The y-axis was log scaled for better comparison of values across different statistics.
clickableLegend({'Case','Death','Hosp','Test'}, 'Location', 'eastoutside');
title('Comparison of Cases, Deaths, Hospitaliations, and Testing')
xlabel('Countries')
ylabel('log10(# per million)')
set(gca, 'YScale', 'log')Unfortunately, MATLAB figure created this way does not retain its interactive functionality through an html export, to see it in action, the example code will have to be executed in MATLAB. The .m script of this code is available in the project github repository. Alternatively, follow the instructions using plotly for MATLAB to easily export interactive html figure.
To use plotly for MATLAB, we download the github repository as listed above and add it to our MATLAB path. If this is your first time trying to use plotly on MATLAB, you’ll have to run a single line command, plotlysetup('your_username', 'your_api_key') with the username and API key that you obtain from plotly website. Further instructions are included in the plotly for matlab github repository.
addpath('./plotly-graphing-library-for-matlab-master');
To create an interactive grouped bar chart, each of the variables being plotted, in our case, # of cases, deaths, hospitalizations, and tests, have to be organized into structs with options specifying the overall grouping variable (by country), variable name, type of plot (bar). Additional paramters, like the color of the groups can be added to customize the figure.
countries = covid_hosp.location(1 : nCountries);
logY = log10(Y);
trace1 = struct(...
'x', { countries }, ...
'y', logY(:,1), ...
'name', 'Total # Cases', ...
'marker', struct('color', 'rgb(166, 206, 227)'),...
'type', 'bar');
trace2 = struct(...
'x', { countries }, ...
'y', logY(:,2), ...
'name', 'Total # Deaths', ...
'marker', struct('color', 'rgb(29, 124, 180)'),...
'type', 'bar');
trace3 = struct(...
'x', { countries }, ...
'y', logY(:,3), ...
'name', 'Total # Hospitalizations', ...
'marker', struct('color', 'rgb(178, 223, 138)'),...
'type', 'bar');
trace4 = struct(...
'x', { countries }, ...
'y', logY(:,4), ...
'name', 'Total # Tested', ...
'marker', struct('color', 'rgb(52, 164, 44)'),...
'type', 'bar');
We finally combine all of the structs for each variable then pass this combined struct array into yet another struct which puts it all together with other parameters like the title, yaxis, xaxis, legend. With the plotly function, we generate the figure (within MATLAB window) and are also provided a url for this figure which is automatically uploaded to your plotly account.
data = {trace1, trace2, trace3, trace4};
layout = struct('title', ...
'Comparison of # of Cases, Deaths, Hospitalizations, and Testing',...
'yaxis', struct(...
'title', 'log10(# per million)'),...
'xaxis', struct(...
'title', 'Countries'),...
'barmode', 'group', ...
'legend', struct(...
'x', 1.0, ...
'y', 1.0, ...
'bgcolor', 'rgba(255, 255, 255, 0)', ...
'bordercolor', 'rgba(255, 255, 255, 0)'));
response = plotly(data, struct('layout', layout, ...
'filename', 'grouped-bar', 'fileopt', 'overwrite'));
plot_url = response.url;
We can then go to the plot_url to export the figure as an html figure on top of the figure generated within MATLAB that is already interactive.
We use the python package plotly to implement our interactive plot. The package contains a lot of helpful functions which enable the user to interact with elements in the plot.
import numpy as np
import plotly.express as px
import plotly.offline as offline
import plotly.graph_objs as goSimilar to the Bubble Plot example, we still use the data of Oct 20, 2020.
# will be using the same dataset as in the bubble plot section
new3 = df2[df2['date'] == '2020-10-20']
new3 = new3[['location', 'total_cases_per_million', 'total_deaths_per_million',
'hosp_patients_per_million', 'total_tests_per_thousand']]
new3 = new3.dropna(axis=0, how='any')[0:5]Here we first use the go.Bar() function to create 4 traces which represent the 4 categories in our plot: total_cases_per_million, total_deaths_per_million, hosp_patients_per_million, and total_tests_per_thousand. Inside the go.Bar() function, we specify the variable and its color. Later we are going to combine these four traces and make the plot using go.Figure() function. Meanwhile, the corresponding legend for this plot will be generated automatically.
branches = new3['location']
total_cases_per_million = np.log10(new3['total_cases_per_million'])
total_deaths_per_million = np.log10(new3['total_deaths_per_million'])
hosp_patients_per_million = np.log10(new3['hosp_patients_per_million'])
total_tests_per_thousand = np.log10(new3['total_tests_per_thousand']*1000)
trace1 = go.Bar(
x = branches,
y = total_cases_per_million,
name = 'Cases',
marker = {'color': 'lightblue'}
)
trace2 = go.Bar(
x = branches,
y = total_deaths_per_million,
name = 'Deaths',
marker = {'color': 'steelblue'}
)
trace3 = go.Bar(
x = branches,
y = hosp_patients_per_million,
name = 'Hospitalizations',
marker = {'color': 'darkseagreen'}
)
trace4 = go.Bar(
x = branches,
y = total_tests_per_thousand,
name = 'Tests',
marker = {'color': 'forestgreen'}
)
data = [trace1, trace2, trace3, trace4]
layout = go.Layout(
title = 'Comparison of Cases, Deaths, Hospitalizations, and Testing',
barmode = 'group')
fig = go.Figure(data = data, layout = layout)
fig.update_xaxes(title_text='Countries');
fig.update_yaxes(title_text='log10(# per Mil.)');
fig.update_layout;
offline.plot(fig, filename='interactive.html');Data visualization can prove to be an immensely useful tool when trying to understand and organize large amounts of data. In this tutorial, we reviewed a few graphical concepts, marginal histogram, bubble plot, and interactive plots to demonstrate the functionality of these different graphical concepts and how they may facilitate a better grasp of the pattern captured in the data.
The tutorial is carried out in 4 different programming languages (R, STATA, Python, MATLAB). When it comes to data visualization, the two open-source software, R and Python take the clear lead in flexibility and functionality over the two proprietary languages, STATA and MATLAB. The open-source nature of R and Python has allowed the development of a vast number of user written libraries capable of producing publication quality plots with large customizability of the plots.
However, at the cost of less customizability on part of the user, plotting in MATLAB and STATA tends to be a little more succinct and straightforward, with many default parameters in place that do not need to be specified. We also find that STATA’s macros come in really handy when we want to unify settings across multiple figures without writing duplicated code.
However, we would note that once we begin to try to implement the same level of specifications in the plots as these open-source software, MATLAB and STATA lose this advantage. For example, with R’s ggplot2, we can easily change aesthetic options. If we want to enlarge the symbol size in a color legend while keeping the symbols in the main plot untouched, simply using guides(color = guide_legend(override.aes = list(size = 3))) will do. The same task can take much more effort in STATA: we have to create another layer of scatterplot that does not have any valid data points (so the plotting area will be empty) but has larger symbol size (for the legend). See the code example below:
sysuse auto
twoway /// the 3rd and 4th layers are empty scatter plots with larger msize
(scatter mpg weight if foreign==0, msymbol(o) mcolor(orange)) ///
(scatter mpg weight if foreign==1, msymbol(o) mcolor(green)) ///
(scatter mpg weight if mpg==., msymbol(o) mcolor(orange) msize(*3)) ///
(scatter mpg weight if mpg==., msymbol(o) mcolor(green) msize(*3)), ///
legend(order(3 4) label(3 domestic) label(4 foreign))Python contains a lot of libraries, which allows different implementations of the same plot type. For instance, the bubble plot can be implemented using either matplotlib or Seaborn. Interestingly, the matplotlib package inherits lots of nice properties from MATLAB. Hence it is possible to make plots similar to MATLAB using this package.
It is a little bit difficult to implement complicated legend in Python using matplotlib (e.g. the size legend in the bubble plot). The same graph in Python can take more lines of code than using MATLAB. It is also worth noting that Python programs become much slower when the data size gets larger.